dplyrBy the end of the lecture, you will be able to …
Download and open code-along-03.qmd
Load the standard packages.
dplyrMost tasks related to data analysis are not glorious or fancy.
A lot of time is dedicated to whipping a dataset into the shape needed to be able to analyze it.
This task has different names “data cleaning,” “data management,” “data manipulation,” “data wrangling,” “data transformation.”
dplyr packageThe dplyr package provides a complete set of functions that help you solve the most common data manipulation challenges such as:
|>The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
dplyr styleIn data transformation pipelines, always use a
|>|>We’ll talk about data visualization pipes later…
Heads Up!
|> (native pipe operator) and %>% (magrittr package) behave identically for simple cases. More info.
function(argument)Functions are (most often) verbs, followed by what they will be applied to in parentheses:
dplyr verbs (functions) will allow you to solve the vast majority of your data manipulation challenges.
dplyr basicsThey are organized into four groups based on what they operate on: rows, columns, groups, or tables.
The verbs all have in common:
dplyr grammarWhat’s the advantage of dplyr grammar? We can sequence data manipulation!
select(), filter(), and drop_na()Use select() to pick specific columns from your dataset.
Use filter() to keep rows that meet a condition.
Use drop_na() to remove rows with missing (NA) values.
group_by() and summarize()Use group_by() to organize your data into groups based on one or more variables.
Use summarize() to compute statistics like total, mean, or median for each group.
gss_all data frame:
# A tibble: 2 × 2
sex freq
<dbl+lbl> <int>
1 1 [male] 1031
2 2 [female] 1363
dplyr() in actionCompare the average and median age at first childbirth for U.S. men and women in 2022.
mutate() in actionUse mutate() to add new columns or change existing ones.
What proportion of new parents were teenagers (e.g., under 18 years old)?
gss_all |>
select(year, agekdbrn) |>
filter(year == 2022) |>
drop_na(agekdbrn) |>
mutate(teen_parent = (agekdbrn < 18) * 1) |>
summarise(proportion = mean(teen_parent))summarise() will report NA
1s
# A tibble: 1 × 1
proportion
<dbl>
1 0.0773
mutate() with case_when()Use case_when() inside mutate() to create values based on conditions.
What proportion of new parents had their first child as teenagers, in their 20s, 30s, or after age 40?
Freq % % Cum.
----------- ------ -------- --------
<18 186 7.73 7.73
18–29 1704 70.82 78.55
30–39 463 19.24 97.80
40+ 53 2.20 100.00
Total 2406 100.00 100.00
Heads Up!
Overwriting datasets and variables can be intentional or unintentional.
Let’s make a tiny data frame to use as an example:
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
Do something and show me
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
Do something, save result, overwriting original
Do something, save result, overwriting original
# A tibble: 5 × 2
x y
<dbl> <chr>
1 2 a
2 4 a
3 6 b
4 8 c
5 10 c
Do something, save result, overwriting original when you shouldn’t
Do something, save result, overwriting original
data frame
Do something, save result, not overwriting original.
Do something and show me
gss_all |>
select(year, agekdbrn) |>
filter(year == 2022) |>
drop_na(agekdbrn) |>
mutate(age_groups = case_when(
agekdbrn < 18 ~ "<18",
agekdbrn >= 18 & agekdbrn <= 29 ~ "18–29",
agekdbrn >= 30 & agekdbrn <= 39 ~ "30–39",
agekdbrn >= 40 ~ "40+",
TRUE ~ NA_character_)) |>
group_by(age_groups) |>
summarise(
count = n(),
proportion = round(count / sum(count), 3)
)Let’s use dplyr grammar to find the median and mode for the childs variable.
gss_all$childs <- zap_missing(gss_all$childs)
gss_all$childs <- as_factor(gss_all$childs)
gss_all$childs <- droplevels(gss_all$childs)
gss_all |>
filter(year == 2024) |>
freq(childs, report.nas = FALSE) |>
tb()dplyr grammar, starting with the name of the df and a pipe
freq() function as usual
tb() function to turn the table into a tibble
# A tibble: 9 × 4
childs freq pct pct_cum
<fct> <dbl> <dbl> <dbl>
1 0 1029 31.4 31.4
2 1 484 14.8 46.2
3 2 851 26.0 72.1
4 3 475 14.5 86.6
5 4 243 7.41 94.0
6 5 96 2.93 96.9
7 6 53 1.62 98.6
8 7 16 0.488 99.1
9 8 or more 31 0.946 100
Let’s use dplyr grammar to find the median and mean for the hrs1 variable.
median(gss_all$hrs1, na.rm=TRUE)
mean(gss_all$hrs1, na.rm=TRUE)
# show me summary statistics
summary(gss_all$hrs1)na.rm is a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.
[1] 40
[1] 41.11279
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 37.00 40.00 41.11 48.00 89.00 32371
descr()Univariate statistics for numerical data
# A tibble: 1 × 9
variable mean sd min med max n.valid n pct.valid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 year 39.4 13.9 0 40 89 1768 1768 100
Heads Up!
descr() can’t handle grouped data. 😟
summarize()# A tibble: 2 × 7
`as_factor(sex)` count min median max mean sd
<fct> <int> <dbl+lbl> <dbl> <dbl+lbl> <dbl> <dbl>
1 male 869 0 40 89 [89+ hours] 41.7 13.7
2 female 891 0 40 89 [89+ hours] 37.3 13.7
On average, in 2024, did parents with 4 or more kids work fewer hours for pay than other parents?
How do we find out?
Can you reproduce this table?
# A tibble: 4 × 5
childs_group count mean median sd
<chr> <int> <dbl> <dbl> <dbl>
1 1 child 642 39.2 40 13.2
2 2 children 279 40.1 40 12.8
3 3 children 437 39.4 40 14.1
4 4 or more children 396 39.6 40 15.3
What’s your conclusion to our research question?